Flow Matching For Generative Modeling

This is a learning note from this series of videos.

Link to the paper: https://arxiv.org/abs/2210.02747

Supplementary papers:

Step-by-Step Diffusion: An Elementary Tutorial

Glow: Generative Flow with Invertible 1x1 Convolutions

1. What is Flow? Understand vector field and ODE.

Generative Modeling: With known samples x1q1(unknown distribution)x_1 \sim q_1 \text{(unknown distribution)}, we want to estimate this unknown distribution.

Method: q0a known distributionϕq1unknown distribution to be estimated\underbrace{q_0}_{\text{a known distribution}} \xrightarrow{\phi} \underbrace{q_1}_{\text{unknown distribution to be estimated}}. This mapping is denoted as ϕ\phi.

How to solve for ϕ\phi: 1. Normalizing Flow; 2. Flow Matching (ODE)

Keypoints of Flow Matching:

  1. Let q0q_0 and q1q_1 be the initial and final points of ODE.
  1. Use neural networks to fit the gradient term in ODE.
  1. Solve the ODE.

Preliminaries

ODE → Flow → Normalizing Flow → Continuous Normalizing Flow

Flow

Definition 1. (Flow): A flow is a collection of time-indexed vector fields v={vt}t[0,1]v = \{ v_t \}_{t \in [0, 1]}.

Any flow defines a trajectory taking initial points x1x_1 to final points x0x_0, by transporting the initial point along the velocity fields {vt}\{ v_t \}. It is equivalent to a transfer between two distributions.

Formally, for velocity field vv and initial point x1x_1, consider the ODE

dxtdt=vt(xt)\begin{equation} \frac{d x_t}{dt} = -v_t(x_t) \end{equation}

with initial condition x1x_1 at time t=1t=1. We write

xtRunFlow(v,x1,t)x_t \coloneqq \text{RunFlow}(v, x_1, t)

to denote the solution to the flow ODE at time tt, terminating at final point x0x_0. That is, RunFlow is the result of transporting point x1x_1 along the flow vv up to time tt.

Flows also define transports between entire distributions by pushing forward points from the source distribution along their trajectories. If p1p_1 is a distribution on initial points, then applying the flow vv yields the distribution on final points:

p0={RunFlow(v,x1,t=0)}x1p1p_0 = \{ \text{RunFlow}(v, x_1, t=0) \}_{x_1 \sim p_1}

This process is denoted as p1vp0p_1 \xrightarrow{v} p_0, meaning the flow vv transports initial distribution p1p_1to final distribution p0p_0.

THE ULTIMATE GOAL OF FLOW MATCHING is to learn a velocity field vv^* which transport p1vp0p_1 \xrightarrow{v^*} p_0, where p0p_0 is the target distribution and p1p_1 is some easy-to-sample base distribution (such as Gaussian).

Continuous Normalizing Flows

Let Rd\mathbb{R}^d denote the data space with data points x=(x1,,xd)Rdx = (x^1, \dots, x^d) \in \mathbb{R}^d.

The probability density path p:[0,1]×RdR>0p: [0,1] \times \mathbb{R}^d \rightarrow \mathbb{R}_{>0} is a time-dependent probability density function, i.e., pt(x)dx=1\int p_t(x) dx = 1.

v:[0,1]×RdRdv: [0,1] \times \mathbb{R}^d \rightarrow \mathbb{R}^d is a time-dependent vector field.

A vector field vtv_t can be used to construct a time-dependent diffeomorphic map, called a flow, ϕ:[0,1]×RdRd\phi: [0,1] \times \mathbb{R}^d \rightarrow \mathbb{R}^d, defined via the ordinary differential equation (ODE):

ddtϕt(x)=vt(ϕt(x))ϕ0(x)=x\begin{align} \frac{d}{dt} \phi_t(x) &= v_t (\phi_t(x)) \\ \phi_0(x) &= x \end{align}

Here ϕt(x)\phi_t(x) is a solution to the ODE, and we call it a flow. We can model the vector field vtv_t with a neural network vt(x;θ)v_t(x ; \theta), where θRp\theta \in \mathbb{R}^p are its learnable parameters.

2. The Continuity Equation and Fokker-Planck Equation

Continuity Equation

How to test if a vector field vtv_t generates a probability path ptp_t?

📖

C´edric Villani. Optimal transport: old and new, volume 338. Springer, 2009.

The following PDE provides a necessary and sufficient condition to ensure that a vector field vtv_t generates a probability path ptp_t.

ddtpt(x)+div(pt(x)vt(x))=0\begin{equation} \frac{d}{dt} p_t(x) + \text{div} (p_t(x)v_t(x)) = 0 \end{equation}

Here the divergence operator div\text{div} is defined with respect to the spatial variable x=(x1,,xd)x = (x^1, \dots, x^d), i.e. div=i=1dxi\text{div} = \sum_{i=1}^d \frac{\partial}{\partial x^i}.

Conditional VFs for Fokker-Planck probability paths

Consider a Stochastic Differential Equation (SDE) of the standard form:

dy=ftdt+gtdw\begin{equation} dy = f_t dt + g_t dw \end{equation}

with time parameter tt, drift ftf_t, diffusion coefficient gtg_t, and dwdw is the Wiener process.

The solution to the SDE is yty_t, which is a stochastic process (a continuous time-dependent variable). Its probability density pt(yt)p_t(y_t) is characterized by the Fokker-Planck equation:

dptdt=div(ftpt)+gt22Δpt\begin{equation} \frac{dp_t}{dt} = -\text{div}(f_tp_t) + \frac{g_t^2}{2} \Delta p_t \end{equation}

where Δ\Delta represents the Laplace operator (in yy), namely div\text{div} \nabla, where \nabla is the gradient operator. We can rewrite the above equation in the form of the continuity equation:

dptdt=div(ftpt)+div(gt22pt)=div(ftptgt22ptptpt)=div((ftgt22logpt)pt)=div(wtpt)\begin{equation} \begin{align*} \frac{dp_t}{dt} &= -\text{div}(f_t p_t) + \text{div}(\frac{g_t^2}{2}\nabla p_t) \\ &= -\text{div} (f_t p_t - \frac{g_t^2}{2}\frac{\nabla p_t}{p_t} p_t) \\ &= -\text{div}((f_t - \frac{g_t^2}{2} \nabla \log p_t)p_t) \\ &= -\text{div}(w_t p_t) \end{align*} \end{equation}

where the vector field

wt=ftgt22logptw_t = f_t - \frac{g_t^2}{2} \nabla \log p_t

satisfies the continuity equation with the probability path ptp_t, and therefore generates ptp_t.

3. Continuous Normalizing Flow (CNF)

A CNF is used to reshape a simple prior density p0p_0 (e.g., pure noise) to a more complicated one, p1p_1, via the push-forward equation.

pt=[ϕt]p0\begin{equation} p_t = [\phi_t]_* p_0 \end{equation}

The push-forward (or change of variable) operator * is defined by:

[ϕt]p0(x)=p0(ϕt1(x))det[ϕt1x(x)]\begin{equation} [\phi_t]_* p_0(x) = p_0(\phi_t^{-1}(x)) \det \left[ \frac{\partial \phi_t^{-1}}{\partial x}(x) \right] \end{equation}
📖

Change of variables in the probability density function

Suppose x\bold{x} is an n-dimensional random variable with joint density ff. If y=G(x)\bold{y} = G(\bold{x}), where GG is a bijective, differentiable function, then y\bold{y} has density pYp_{\bold{Y}}:

pY(y)=f(G1(y))det[dG1(y)dy]p_{\bold{Y}}(\bold{y}) = f\left( G^{-1}(\bold{y}) \right) \left| \det \left[ \frac{d G^{-1}(\bold{y})}{d \bold{y}} \right] \right|

Relationships between Flow, Vector Field, and Probability Density

  1. Equation (2)(3) describe the relationship between flow ϕt(x)\phi_t(x) and vector field vt(x)v_t(x)
  1. Equation (8)(9) describe how the flow ϕt(x)\phi_t(x) changes the density from p0p_0 to ptp_t.
  1. Equation (4) (continuity equation) gives a necessary and sufficient condition to test whether a vector field vtv_t generates a probability path ptp_t.